Initial NLP Work on Film Scripts

Author

Mick Cooney

Published

January 17, 2023

In this workbook we perform the initial NLP pre-processing and simple explorations and visualisations of the data.

1 Load Data

Show code
films_master_tbl <- read_rds("data/films_master_tbl.rds")

films_master_tbl |> glimpse()
Rows: 48
Columns: 9
$ film_title      <chr> "12 Years a Slave", "2001 A Space Odyssey", "Airplane"…
$ release_year    <int> 2013, 1968, 1977, 1979, 1997, 1955, 1974, 1990, 2012, …
$ genre           <chr> "Drama", "Science Fiction", "Comedy", "War", "Comedy",…
$ title_cleaned   <chr> "12_years_a_slave", "2001_a_space_odyssey", "airplane"…
$ cached_htmlfile <glue> "scraped_files/scraped_raw/scraped_raw_12_years_a_sla…
$ script_txtfile  <glue> "scraped_files/scraped_raw/script_12_years_a_slave.tx…
$ cleaned_txtfile <glue> "scraped_files/cleaned_script/cleanedscript_12_years_…
$ flag_cleaned    <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,…
$ parsed_file     <glue> "data/parsed_scripts/parsedscript_12_years_a_slave.rd…

The parsed data is stored in the file listed in the column parsed_file and contains to separate tibbles, one with the detailed parsing of the script and the second which aggregates all the text for both scene directions and dialogue.

Also, a number of film scripts have not parsed properly, so we also want to create a list of those films and exclude them from the analysis.

We may go back later to improve the parsing and if this happens we will updated this list.

Show code
films_exclude_tbl <- c(
    "12_years_a_slave", "2001_a_space_odyssey", "django_unchained",
    "donnie_brasco", "drive", "gran_torino", "leaving_las_vegas",
    "lock_stock_and_two_smoking_barrels", "moneyball", "office_space",
    "star_wars_return_of_the_jedi", "the_green_mile"
    ) |>
  enframe(name = NULL, value = "title_cleaned")

films_exclude_tbl |> glimpse()
Rows: 12
Columns: 1
$ title_cleaned <chr> "12_years_a_slave", "2001_a_space_odyssey", "django_unch…
Show code
films_parsed_tbl <- films_master_tbl |>
  anti_join(films_exclude_tbl, by = "title_cleaned") |>
  mutate(
    parsed_data = map(parsed_file, read_rds)
    ) |>
  unnest(parsed_data) |>
  select(
    film_title, release_year, genre, title_cleaned, parsing_detailed,
    parsing_aggregated
    )

films_parsed_tbl |> glimpse()
Rows: 36
Columns: 6
$ film_title         <chr> "Airplane", "Apocalypse Now", "Austin Powers - Inte…
$ release_year       <int> 1977, 1979, 1997, 1955, 1974, 1990, 1988, 1997, 201…
$ genre              <chr> "Comedy", "War", "Comedy", "Western", "Crime", "Wes…
$ title_cleaned      <chr> "airplane", "apocalypse_now", "austin_powers_intern…
$ parsing_detailed   <list> [<tbl_df[5646 x 13]>], [<tbl_df[8014 x 13]>], [<tb…
$ parsing_aggregated <list> [<tbl_df[1245 x 5]>], [<tbl_df[1388 x 5]>], [<tbl_…

2 Initial NLP Processing

We now want to perform some very basic NLP processing such as tokenisation.

Once we have tokenised the script, we also remove “stop words” - that is, common words that do not convey meaning, such as “and”, “to”, “the” and so on.

Show code
data(stop_words)

films_tokens_tbl <- films_parsed_tbl |>
  mutate(
    wordtoken_data = map(
      parsing_aggregated, unnest_tokens,
      output = word, input = trimmed_text
      ),
    ngramtoken_data = map(
      parsing_aggregated, unnest_tokens,
      output = word, input = trimmed_text, token = "ngrams", n = 2, n_min = 1
      )
    ) |>
  select(-parsing_detailed, -parsing_aggregated)

films_wordtoken_unstopped_tbl <- films_tokens_tbl |>
  select(-ngramtoken_data) |>
  unnest(wordtoken_data) |>
  select(-full_text)

films_wordtoken_tbl <- films_wordtoken_unstopped_tbl |>
  anti_join(stop_words, by = "word")

2.1 Show Initial Wordclouds

We now want to create some word clouds as a quick initial visualisation of the data.

Show code
plot_unstopped_tbl <- films_wordtoken_unstopped_tbl |>
  count(word) |>
  slice_max(order_by = n, n = 500)

ggwordcloud2(plot_unstopped_tbl, size = 4, seed = 421)

2.2 Word-stemming

We also want to look at stemming our words.

Show code
films_stems_tbl <- films_wordtoken_tbl |>
  mutate(
    snowball_stem = wordStem(word),
    hunspell_stem = hunspell_stem(word)
    )

films_stems_tbl |> glimpse()
Rows: 339,670
Columns: 10
$ film_title    <chr> "Airplane", "Airplane", "Airplane", "Airplane", "Airplan…
$ release_year  <int> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 19…
$ genre         <chr> "Comedy", "Comedy", "Comedy", "Comedy", "Comedy", "Comed…
$ title_cleaned <chr> "airplane", "airplane", "airplane", "airplane", "airplan…
$ section_title <chr> "Written by", "Written by", "Written by", "Written by", …
$ grouping_id   <int> 2, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ flag_dialogue <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ word          <chr> "jim", "abrahams", "david", "zucker", "jerry", "zucker",…
$ snowball_stem <chr> "jim", "abraham", "david", "zucker", "jerri", "zucker", …
$ hunspell_stem <list> <>, <>, <>, <>, <>, <>, "fade", "ext", "sky", "cloud", …

We also create a word cloud of this stemmed data

Show code
plot_stemmed_tbl <- films_stems_tbl |>
  count(word = snowball_stem) |>
  slice_max(order_by = n, n = 500)

ggwordcloud2(plot_stemmed_tbl, size = 2, seed = 422)

2.3 Contrasting Dialogue and Direction

Finally we look just at the words in the lines of dialogue and focus on this.

Show code
plot_dialogue_tbl <- films_wordtoken_tbl |>
  filter(flag_dialogue == TRUE) |>
  count(word) |>
  slice_max(order_by = n, n = 500)

ggwordcloud2(plot_stemmed_tbl, size = 2, seed = 422)

3 Sentiment Analysis

Sentiment analysis takes the simple approach of assigning some kind of measure of sentiment or emotion to each word, allowing us to quantify these concepts in the text in various ways.

Note that this approach is simplistic: it does not consider context or anything beyond the presence of each word, but it is a quick and simple thing to look at.

There are a number of different sets of sentiment data, so we

Show code
sentiments_afinn_tbl    <- get_sentiments("afinn")
sentiments_afinn_tbl    |> glimpse()
Rows: 2,477
Columns: 2
$ word  <chr> "abandon", "abandoned", "abandons", "abducted", "abduction", "ab…
$ value <dbl> -2, -2, -2, -2, -2, -2, -3, -3, -3, -3, 2, 2, 1, -1, -1, 2, 2, 2…
Show code
sentiments_bing_tbl     <- get_sentiments("bing")
sentiments_bing_tbl     |> glimpse()
Rows: 6,786
Columns: 2
$ word      <chr> "2-faces", "abnormal", "abolish", "abominable", "abominably"…
$ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …
Show code
sentiments_loughran_tbl <- get_sentiments("loughran")
sentiments_loughran_tbl |> glimpse()
Rows: 4,150
Columns: 2
$ word      <chr> "abandon", "abandoned", "abandoning", "abandonment", "abando…
$ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …
Show code
sentiments_nrc_tbl      <- get_sentiments("nrc")
sentiments_nrc_tbl      |> glimpse()
Rows: 13,872
Columns: 2
$ word      <chr> "abacus", "abandon", "abandon", "abandon", "abandoned", "aba…
$ sentiment <chr> "trust", "fear", "negative", "sadness", "anger", "fear", "ne…

3.1 Visualising the NRC Sentiments

We use the NRC sentiments and the count the appearance of each emotion in this dataset.

Show code
plot_sentiments_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_nrc_tbl, by = "word") |>
  count(title_cleaned, sentiment)

ggplot(plot_sentiments_tbl) +
  geom_tile(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment, fill = n)
    ) +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(
    x = "Film Title",
    y = "Sentiment",
    fill = "Raw Count",
    title = "Sentiments in Film Scripts"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5))

Raw counts are interesting, but it is also worth looking at scaling these counts by the total word count of the script, and then plot each of those counts as a ratio of the total word count in the script.

Show code
films_wordcount_tbl <- films_wordtoken_tbl |>
  count(title_cleaned, name = "total_count")

plot_sentiments_ratio_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_nrc_tbl, by = "word") |>
  count(title_cleaned, sentiment, name = "word_count") |>
  inner_join(films_wordcount_tbl, by = "title_cleaned") |>
  mutate(word_ratio = word_count / total_count)
  
ggplot(plot_sentiments_ratio_tbl) +
  geom_tile(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment, fill = word_ratio)
    ) +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(
    x = "Film Title",
    y = "Sentiment",
    fill = "Ratio",
    title = "Sentiments in Film Scripts"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5))

4 R Environment

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 20.04.5 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Etc/UTC
 date     2023-01-17
 pandoc   2.19.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.2.0)
 backports       1.4.1   2021-12-13 [1] RSPM (R 4.2.0)
 broom           1.0.1   2022-08-29 [1] RSPM (R 4.2.0)
 cachem          1.0.6   2021-08-19 [1] RSPM (R 4.2.0)
 cellranger      1.1.0   2016-07-27 [1] RSPM (R 4.2.0)
 cli             3.4.1   2022-09-23 [1] RSPM (R 4.2.0)
 codetools       0.2-18  2020-11-04 [2] CRAN (R 4.2.1)
 colorspace      2.0-3   2022-02-21 [1] RSPM (R 4.2.0)
 conflicted    * 1.1.0   2021-11-26 [1] RSPM (R 4.2.0)
 cowplot       * 1.1.1   2020-12-30 [1] RSPM (R 4.2.0)
 crayon          1.5.2   2022-09-29 [1] RSPM (R 4.2.0)
 DBI             1.1.3   2022-06-18 [1] RSPM (R 4.2.0)
 dbplyr          2.2.1   2022-06-27 [1] RSPM (R 4.2.0)
 digest          0.6.30  2022-10-18 [1] RSPM (R 4.2.0)
 dplyr         * 1.0.10  2022-09-01 [1] RSPM (R 4.2.0)
 ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.2.0)
 evaluate        0.17    2022-10-07 [1] RSPM (R 4.2.0)
 fansi           1.0.3   2022-03-24 [1] RSPM (R 4.2.0)
 farver          2.1.1   2022-07-06 [1] RSPM (R 4.2.0)
 fastmap         1.1.0   2021-01-25 [1] RSPM (R 4.2.0)
 fastmatch       1.1-3   2021-07-23 [1] RSPM (R 4.2.0)
 forcats       * 0.5.2   2022-08-19 [1] RSPM (R 4.2.0)
 fs            * 1.5.2   2021-12-08 [1] RSPM (R 4.2.0)
 furrr         * 0.3.1   2022-08-15 [1] RSPM (R 4.2.0)
 future        * 1.28.0  2022-09-02 [1] RSPM (R 4.2.0)
 gargle          1.2.1   2022-09-08 [1] RSPM (R 4.2.0)
 generics        0.1.3   2022-07-05 [1] RSPM (R 4.2.0)
 ggplot2       * 3.3.6   2022-05-03 [1] RSPM (R 4.2.0)
 ggwordcloud   * 0.5.0   2019-06-02 [1] RSPM (R 4.2.0)
 globals         0.16.1  2022-08-28 [1] RSPM (R 4.2.0)
 glue          * 1.6.2   2022-02-24 [1] RSPM (R 4.2.0)
 googledrive     2.0.0   2021-07-08 [1] RSPM (R 4.2.0)
 googlesheets4   1.0.1   2022-08-13 [1] RSPM (R 4.2.0)
 gtable          0.3.1   2022-09-01 [1] RSPM (R 4.2.0)
 haven           2.5.1   2022-08-22 [1] RSPM (R 4.2.0)
 hms             1.1.2   2022-08-19 [1] RSPM (R 4.2.0)
 htmltools       0.5.3   2022-07-18 [1] RSPM (R 4.2.0)
 htmlwidgets     1.5.4   2021-09-08 [1] RSPM (R 4.2.0)
 httr            1.4.4   2022-08-17 [1] RSPM (R 4.2.0)
 hunspell      * 3.0.2   2022-09-04 [1] RSPM (R 4.2.0)
 janeaustenr     1.0.0   2022-08-26 [1] RSPM (R 4.2.0)
 jsonlite        1.8.3   2022-10-21 [1] RSPM (R 4.2.0)
 knitr           1.40    2022-08-24 [1] RSPM (R 4.2.0)
 labeling        0.4.2   2020-10-20 [1] RSPM (R 4.2.0)
 lattice         0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
 lifecycle       1.0.3   2022-10-07 [1] RSPM (R 4.2.0)
 listenv         0.8.0   2019-12-05 [1] RSPM (R 4.2.0)
 lubridate       1.8.0   2021-10-07 [1] RSPM (R 4.2.0)
 magrittr      * 2.0.3   2022-03-30 [1] RSPM (R 4.2.0)
 Matrix          1.4-1   2022-03-23 [2] CRAN (R 4.2.1)
 memoise         2.0.1   2021-11-26 [1] RSPM (R 4.2.0)
 modelr          0.1.9   2022-08-19 [1] RSPM (R 4.2.0)
 munsell         0.5.0   2018-06-12 [1] RSPM (R 4.2.0)
 parallelly      1.32.1  2022-07-21 [1] RSPM (R 4.2.0)
 pillar          1.8.1   2022-08-19 [1] RSPM (R 4.2.0)
 pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.2.0)
 png             0.1-7   2013-12-03 [1] RSPM (R 4.2.0)
 purrr         * 0.3.5   2022-10-06 [1] RSPM (R 4.2.0)
 quanteda      * 3.2.3   2022-08-29 [1] RSPM (R 4.2.0)
 R6              2.5.1   2021-08-19 [1] RSPM (R 4.2.0)
 rappdirs        0.3.3   2021-01-31 [1] RSPM (R 4.2.0)
 Rcpp            1.0.9   2022-07-08 [1] RSPM (R 4.2.0)
 RcppParallel    5.1.5   2022-01-05 [1] RSPM (R 4.2.0)
 readr         * 2.1.3   2022-10-01 [1] RSPM (R 4.2.0)
 readxl          1.4.1   2022-08-17 [1] RSPM (R 4.2.0)
 reprex          2.0.2   2022-08-17 [1] RSPM (R 4.2.0)
 rlang         * 1.0.6   2022-09-24 [1] RSPM (R 4.2.0)
 rmarkdown       2.17    2022-10-07 [1] RSPM (R 4.2.0)
 rstudioapi      0.14    2022-08-22 [1] RSPM (R 4.2.0)
 rvest           1.0.3   2022-08-19 [1] RSPM (R 4.2.0)
 scales          1.2.1   2022-08-20 [1] RSPM (R 4.2.0)
 sessioninfo     1.2.2   2021-12-06 [1] RSPM (R 4.2.0)
 SnowballC     * 0.7.0   2020-04-01 [1] RSPM (R 4.2.0)
 stopwords       2.3     2021-10-28 [1] RSPM (R 4.2.0)
 stringi         1.7.8   2022-07-11 [1] RSPM (R 4.2.0)
 stringr       * 1.4.1   2022-08-20 [1] RSPM (R 4.2.0)
 textdata        0.4.4   2022-09-02 [1] RSPM (R 4.2.0)
 tibble        * 3.1.8   2022-07-22 [1] RSPM (R 4.2.0)
 tidyr         * 1.2.1   2022-09-08 [1] RSPM (R 4.2.0)
 tidyselect      1.2.0   2022-10-10 [1] RSPM (R 4.2.0)
 tidytext      * 0.3.4   2022-08-20 [1] RSPM (R 4.2.0)
 tidyverse     * 1.3.2   2022-07-18 [1] RSPM (R 4.2.0)
 tokenizers      0.2.3   2022-09-23 [1] RSPM (R 4.2.0)
 tzdb            0.3.0   2022-03-28 [1] RSPM (R 4.2.0)
 utf8            1.2.2   2021-07-24 [1] RSPM (R 4.2.0)
 vctrs           0.5.0   2022-10-22 [1] RSPM (R 4.2.0)
 withr           2.5.0   2022-03-03 [1] RSPM (R 4.2.0)
 xfun            0.34    2022-10-18 [1] RSPM (R 4.2.0)
 xml2            1.3.3   2021-11-30 [1] RSPM (R 4.2.0)
 yaml            2.3.6   2022-10-18 [1] RSPM (R 4.2.0)

 [1] /usr/local/lib/R/site-library
 [2] /usr/local/lib/R/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────